ol regression
AI Meets the Classroom: When Does ChatGPT Harm Learning?
Lehmann, Matthias, Cornelius, Philipp B., Sting, Fabian J.
In this paper, we study how generative AI and specifically large language models (LLMs) impact learning in coding classes. We show across three studies that LLM usage can have positive and negative effects on learning outcomes. Using observational data from university-level programming courses, we establish such effects in the field. We replicate these findings in subsequent experimental studies, which closely resemble typical learning scenarios, to show causality. We find evidence for two contrasting mechanisms that determine the overall effect of LLM usage on learning. Students who use LLMs as personal tutors by conversing about the topic and asking for explanations benefit from usage. However, learning is impaired for students who excessively rely on LLMs to solve practice exercises for them and thus do not invest sufficient own mental effort. Those who never used LLMs before are particularly prone to such adverse behavior. Students without prior domain knowledge gain more from having access to LLMs. Finally, we show that the self-perceived benefits of using LLMs for learning exceed the actual benefits, potentially resulting in an overestimation of one's own abilities. Overall, our findings show promising potential of LLMs as learning support, however also that students have to be very cautious of possible pitfalls.
Estimating Causal Effects with Double Machine Learning -- A Method Evaluation
Fuhr, Jonathan, Berens, Philipp, Papies, Dominik
The estimation of causal effects with observational data continues to be a very active research area. In recent years, researchers have developed new frameworks which use machine learning to relax classical assumptions necessary for the estimation of causal effects. In this paper, we review one of the most prominent methods - "double/debiased machine learning" (DML) - and empirically evaluate it by comparing its performance on simulated data relative to more traditional statistical methods, before applying it to real-world data. Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships. This advantage enables a departure from traditional functional form assumptions typically necessary in causal effect estimation. However, we demonstrate that the method continues to critically depend on standard assumptions about causal structure and identification. When estimating the effects of air pollution on housing prices in our application, we find that DML estimates are consistently larger than estimates of less flexible methods. From our overall results, we provide actionable recommendations for specific choices researchers must make when applying DML in practice.
Can you Deep Learn the Stock Market? "Honestly," no
You can find many examples of Deep Neural Network (DNN) models that successfully forecast the stock market. Typically, these models are using a very short time frequency. As variables inputs, these DDN models use a number of other stock indices that correlate with the S&P 500. They often use autoregressive variables (most recent S&P 500 levels). The mentioned high-frequency trading DNN models use covariates, or variables that are absent any explanatory logic besides being correlated with the S&P 500 (or whatever stock they predict). Let's step back and differentiate between covariates and explanatory variables because this is at the essence of my effort to Deep Learn the stock market "honestly." The mentioned "successful" high-frequency trading DNN models use covariates, or variables that are absent any true exogenous explanatory logic regarding the behavior of the S&P 500. Stating that the S&P 500 moves in tandem with the Nikkei 225 is not explanatory per se. It just exploits a tautological correlation.
Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach
Schulze, P., Wiegrebe, S., Thurner, P. W., Heumann, C., Aßenmacher, M., Wankmüller, S.
Topic models such as the Structural Topic Model (STM) estimate latent topical clusters within text. An important step in many topic modeling applications is to explore relationships between the discovered topical structure and metadata associated with the text documents. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself. The authors of the STM, for instance, perform repeated OLS regressions of sampled topic proportions on metadata covariates by using a Monte Carlo sampling technique known as the method of composition. In this paper, we propose two improvements: first, we replace OLS with more appropriate Beta regression. Second, we suggest a fully Bayesian approach instead of the current blending of frequentist and Bayesian methods. We demonstrate our improved methodology by exploring relationships between Twitter posts by German members of parliament (MPs) and different metadata covariates.
The Simpler Brother of OLS Regression for Machine Learning
Nonparametric's took me a while to get my head around. On the one hand, all I had ever studied involved making the formulae of a predictive model differentiable and optimizing in regards to the individual or set of parameters (think linear regression or GMM). On the other hand, the majority of the nonparametric methods were being used in classification (Random Forests, KNN, etc). But some of the best methods are nonparametric. They do not assume a particular family of distributions and try to select the best-fit ones, they make judgments without assuming a distribution. Keep up to date with my latest articles here!
Estimating Treatment Effects with Observed Confounders and Mediators
Gupta, Shantanu, Lipton, Zachary C., Childers, David
Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor formula applies when an observed mediator transmits the causal effect. In this paper, we investigate the over-identified scenario where both confounders and mediators are observed, rendering both estimators valid. Addressing the linear Gaussian causal model, we derive the finite-sample variance for both estimators and demonstrate that either estimator can dominate the other by an unbounded constant factor depending on the model parameters. Next, we derive an optimal estimator, which leverages all observed variables to strictly outperform the backdoor and frontdoor estimators. We also present a procedure for combining two datasets, with confounders observed in one and mediators in the other. Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets.
Essential Machine Learning with Linear Models in RAPIDS: Part 1 of a Series
I want to take a moment to tell the origin story of regression analysis, which will explain why it has that name. I believe that of all the common machine learning techniques (K-means, kNN, PCA), "regression analysis" has the most opaque name. OLS regression was first invented to analyze exceptional genetic traits and their heritability. These early studies seemed to show the offspring of exceptional individuals "regressed to the mean". The inventor was Sir Francis Galton (half-cousin of Charles Darwin²), who had previously invented the standard deviation and first observed the "wisdom of the crowds" in certain estimation tasks. I am trying to predict daily demand for short-term bike rentals made in 2012, and I have data from 2011 to build the model.
Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series.
This blog is the first in a series about regression analysis in RAPIDS, an open GPU data science platform. There are many varieties of regression techniques, and we're working to include them all in RAPIDS. In this blog edition, I use Ordinary Least Squares (OLS) and Ridge regression to choose a model to predict Washington, D.C. bikeshare rentals¹. I want to take a moment to tell the origin story of regression analysis, which will explain why it has that name. I believe that of all the common machine learning techniques (K-means, kNN, PCA), "regression analysis" has the most opaque name.